Compositions and methods for glycated consumables

ABSTRACT

The method for protein selection can include: characterizing a protein set, training a prediction model, determining target characteristic values, and determining a candidate protein set based on the target characteristic values.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application No.63/297,966 filed 10 Jan. 2022, US Provisional Application Ser. No.63/298,920 filed 12 Jan. 2022, US Provisional Application Ser. No.63/298,927 filed 12 Jan. 2022, and US Provisional Application Ser. No.63/298,930 filed 12 Jan. 2022, each of which is incorporated in itsentirety by this reference.

TECHNICAL FIELD

This invention relates generally to the food science field, and morespecifically to a new and useful system and method in the food sciencefield.

BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a schematic representation of a variant of the method.

FIG. 2 is a schematic representation of a variant of the system.

FIG. 3 depicts an illustrative example of a database.

FIG. 4 depicts an illustrative example of functional property value setsassociated with different source components and constituent proteins.

FIGS. 5A and 5B depicts illustrative examples of aggregating featurevalues for a protein set.

FIG. 6 depicts an embodiment of training a prediction model.

FIG. 7A depicts a first example of training a prediction model topredict functional property values.

FIG. 7B depicts a second example of training a prediction model topredict functional property values.

FIG. 8 depicts an example of determining a candidate protein set.

FIG. 9 depicts an illustrative example of determining a candidateprotein set.

FIG. 10 , depicts an embodiment of target determination.

FIG. 11 depicts another embodiment of target determination.

FIG. 12 depicts an example of predicting the functional properties for aprotein set and optionally predicting a protein set or protein sourceset.

FIG. 13 depicts an example of predicting the functional properties for aprotein set.

FIG. 14 depicts example functional property values for samples producedusing phosphorylated proteins.

DETAILED DESCRIPTION

The following description of the embodiments of the invention is notintended to limit the invention to these embodiments, but rather toenable any person skilled in the art to make and use this invention.

1. Overview

As shown in FIG. 1 , the method can include: characterizing a proteinset S100, training a prediction model S300, determining targetcharacteristic values S400, determining a candidate protein set based onthe target characteristic values S500, and/or any other suitable steps.

In variants, the method can function to determine a candidate proteinset with a desired set of functional property values (e.g., wherein thecandidate protein set can be used in a replacement for a target foodproduct). For example, the candidate protein set can be selected toreplicate target functional property values of and/or replace: caseins,leather proteins (e.g., collagen, gelatin, etc.), meat proteins (e.g.,myosin), and/or any other protein set. In variants, the method canoptionally determine protein source sets that contain the candidateprotein set.

2. Examples

In an example, the method can include: predicting functional propertyvalues given a protein set and optionally a context (e.g., example shownin FIG. 14 ). In an illustrative example, the method can include:extracting feature values from the amino acid sequences for each of aset of protein sets, measuring functional property values for the set ofprotein sets, and training a prediction model to predict functionalproperty values for a protein set based on feature values for therespective protein set. In a specific example, the prediction model canpredict functional property values for the protein set based onaggregated feature values across individual proteins in the protein set.A protein set can optionally be associated with a composition (e.g., arelative and/or absolute concentration for each protein in the trainingprotein set) and/or a context (e.g., manufacturing process parameters,protein modifications, etc.), wherein the composition and/or context canbe inputs to the prediction model (e.g., separate vectors, concatenatedto the protein set feature vector, used to weight the protein setfeature vector, etc.).

In variants, measuring functional property values for a protein set caninclude manufacturing a sample matching the protein set compositionusing the process parameters and/or other context information, whereinthe functional property values for the protein set are measured usingassays. In a specific example, the target functional property values canbe directly measured for a target product (e.g., a target food product).

In variants, the prediction model can be used to predict functionalproperty values for each protein set in a candidate group, wherein thecandidate group includes uncharacterized protein sets (e.g., withoutmeasured functional property data). A candidate protein set (e.g.,including an associated composition and/or context) and/or a proteinsource with a high probability of producing the candidate protein setcan then be selected from the candidate group based on a similaritybetween the predicted functional property values and target functionalproperty values. Additionally or alternatively, a candidate protein setcan be extracted from the prediction model (e.g., using an acquisitionfunction), be predicted by a second model (e.g., a decoder), and/orotherwise determined.

3. Technical Advantages

Variants of the technology can confer one or more advantages overconventional technologies.

First, previous protein selection methodologies (e.g., to identifyreplacements for dairy and/or meat proteins) relied heavily on domainknowledge, previously researched protein alternatives, and laboriousmanual testing. Variants of the technology can utilize a computationalapproach to explore the extremely large and under-investigated proteinspace to identify candidate proteins that would not have otherwise beenidentified. For example, variants of the method can identify proteinreplacements based on the similarities between the amino acid sequencefeatures (AA sequence features) of the candidate proteins and the targetproteins (proteins to be replaced), and/or based on similarities betweenthe predicted functional properties of the candidate proteins and thefunctional properties of the target product (e.g., food).

Second, variants of the technology can use a subset of features (e.g.,subset of amino acid sequence features) which are likely to be importantin influencing functional behavior. In a specific example, thefunctional property values are experimentally determined for proteinsets (e.g., gelled mixtures of proteins) to capture importantprotein-protein interactions influencing function, and correlated withthe feature values for the constituent proteins, wherein predictivefeatures are selected for subsequent analysis based on the correlation.In a second specific example, lift analysis can be used (e.g., duringand/or after training a prediction model) to select a subset of featureswith high lift. This feature selection can reduce computationalcomplexity and/or enable human-interpretable annotation of the features.

Third, variants of the technology can reduce the need for experimentalanalysis of proteins to determine their candidacy potential. In anexample, a large domain of available protein sets can be computationallyanalyzed (e.g., using featurization of their amino acid sequences)rather than experimentally analyzed to evaluate their potential toreplicate functional properties of a target set of proteins. Thisanalysis methodology can enable a much larger group of candidates to beconsidered than if experimental analysis of each protein set wererequired.

Fourth, variants of the technology can reduce the need for experimentalanalysis of potential protein sources by predicting whether a proteinsource (e.g., plant, plant component, etc.) will include sufficientamounts of a given protein or protein set, such as by using geneticanalyses and/or evolutionary tree analyses.

However, further advantages can be provided by the system and methoddisclosed herein.

4. System

Variants of the system can include a database and a set of models. Thesystem functions to determine the functional properties for proteinsets, determine which protein sets can produce a set of targetfunctional properties, determine which protein sources can produce atarget protein set, and/or be otherwise used.

An example of the system, including a database, is shown in FIG. 2 . Anexample of the database is shown in FIG. 3 . The database can includeproteins, protein sets (e.g., protein set identifiers), protein setcompositions (e.g., identification of proteins in the set, relativeand/or absolute concentrations of proteins in the set, etc.), sequences,features, feature values, functional properties, functional propertyvalues, protein sources and/or source components, evolutionaryrelationships, contexts (e.g., process parameters, proteinmodifications, sample environment, etc.), and/or any other elements. Thesystem can optionally include and/or interface with one or morethird-party databases (e.g., a sequence database, a protein database,amino acid composition database, etc.). In a first example, elementsstored in the system database can be retrieved from a third-partydatabase. In a second example, the system database can be a third-partydatabase.

A protein set can be an individual protein (e.g., a set of one, anindividual protein within a larger set, etc.), multiple proteins (e.g.,a mixture of proteins, proteins within a source and/or source component;within a gel, sample, product, solution, combination, and/or othermixture; within a food product; within a consumer product; etc.), a setof protein sets, and/or be otherwise defined.

The protein set can be from one or more protein sources (e.g.,combination of protein sources), from one or more components of proteinsources, be manually specified, and/or be otherwise determined. Theprotein source can be plant matter (e.g., processed and/or unprocessedplant matter), animal matter (e.g., milk such as cow milk, insects suchas Acheta domesticus, meat, etc.), bacterium (e.g., naturally occurring,genetically modified, etc.), any organism (e.g., identified by a speciesname, a common name, etc.), a food product, a naturally-occurringprotein source, a synthetic protein source, and/or any other entityand/or component (e.g., protein source component) thereof. The proteinsource component (e.g., the part of the source where the protein set canbe derived) can be a nut, fruit, seed, legumes, stem, leaves, root,flower, stamen, muscle, carapace, and/or any other component of theassociated source. The protein source can optionally be labeled (e.g.,in the database) with one or more classifications (e.g., dairy, meat,non-dairy, non-meat, etc.). The protein source, source component, and/orthe protein set can optionally be associated with an abundance metric(e.g., where the metric can assess the ease of accessing largequantities of the protein set for scaled use). The abundance metric canbe: experimentally determined (e.g., measured), predicted (e.g., basedon the abundance metrics for related protein sources), and/or otherwisedetermined. The abundance metric is preferably representative of asingle protein's abundance within a protein source, but canalternatively be representative of a protein set's abundance within aprotein source, be representative of the protein source's abundance,and/or represent other information.

The protein set can include all or a subset of proteins in the proteinsource and/or protein source component. In a first example, the proteinset can include proteins above a concentration threshold in the proteinsource and/or source component (e.g., wherein the concentrationthreshold by weight can be 0.1%, 0.2%, 0.5%, 1%, 2%, 5%, 10%, 15%, 20%,25%, 50%, etc.). In a second example, the protein set can include themost abundant (e.g., highest concentration) proteins in a protein sourceand/or a component of the protein source. In a specific example, theprotein set can include a predetermined number of the most abundantproteins (e.g., wherein the predetermined number can be 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 15, 20, 50, etc.).

Plant matter can include: peas (e.g., pea flour, pea starch, etc.), rice(e.g., rice flour, glutinous rice flour, white rice flour, brown riceflour, etc.), fruits (e.g., citrus fiber), cassava (e.g., cassavaflour), potato, cocoa beans, truffles, olives, coconut flesh, grapepomace, pumpkin (e.g., pumpkin seed), cottonseed, canola, sunflower,hazelnut, pistachio, almond, walnut, crude walnut, cashew, brazil nuts,hazelnut, macadamia nuts, pecan, peanut, hemp, oat, rice, poppy,watermelon (e.g., watermelon seed), chestnut, chia, flax, quinoa,soybean, split mung beans, aquafaba, lupini, fenugreek, kiwi, Sichuanpepper, mustard, sesame, sunflower seeds, algae, duckweeds (e.g.,lenna), squash, chickpeas, pine nuts, peas, cassava, citrus (e.g.,citrus fiber), fava bean (e.g., fava bean flower), grape (e.g., grapepomace), lima bean (e.g., lima bean paste), carrageenan; plants selectedfrom the cucurbita, anacardium, cannabis, salvia, arachis, brassica,sesamun, legume, and/or other genuses; plants selected from theAnacardiaceae, Asteraceae, Leguminosae, Cucurbits, Rosaceae, Lamiaceae,and/or other family; a combination thereof, and/or any other plantmatter. The plant matter may include major production oilseeds (e.g.,soybean, rapeseed, sunflower, sesame, niger, castor, canola, cottonseed,etc.), minor production oilseeds (coconut, palm seed, pumpkin, etc.),and/or other crops or plant matter. The plant matter may excludeallergens (e.g., wheat, soy, peanut, etc.). The plant matter may includea single variety of plant matter, a mixture of various plant matter,include animal matter (e.g., insect matter, mammalian products, etc.),and/or include matter from any other source.

The protein source can be processed (e.g., lipid-removed, comminuted,separated into a solid and liquid component, mechanical processing,chemical processing, a protein powder derived from the plant matter, anextract from the plant matter, fermented, protein modifications, etc.)and/or unprocessed.

For example, the protein source can include a plant milk, powdered wholeplant component (e.g., plant matter) in an aqueous solution, isolatedplant protein (e.g., powder), and/or any other suitable source ofprotein. [0022] One or more proteins can be derived (e.g., extracted)from the protein source. The proteins can include protein isolates(e.g., solubilized protein isolates) extracted from the protein source.Protein isolates can include: proteins isolated using isoelectricprecipitation (e.g., salting in, salting out, etc.), by collecting andoptionally diluting a protein-rich solution (e.g., the supernatantobtained by spinning down a whole plant ingredient, such as a seedpowder; residual obtained by removing at least a threshold proportion ofinsoluble solids from a plant milk, such as 50%, 75%, 80%, 90%, etc.;etc.), and/or otherwise obtained. The protein ingredient obtained fromthe plant matter can be substantially pure (e.g., wherein a singlemonomeric or multimeric protein represents at least 50%, 60%, 70%, 80%,90%, and/or more than 90% of the overall protein content in the proteiningredient and/or the product), but can alternatively be impure (e.g.,include more than 10%, 20%, 30%, 40%, 50%, 60% other proteins, etc.).

The proteins can include structured protein isolates (SPIs) producedusing protein isolates. In a first example, SPIs can be produced by:obtaining a protein isolate mixture (e.g., a protein isolate solution)from a protein source; diluting the protein isolate mixture using adiluent; optionally separating the diluted protein isolate mixture(e.g., allowing sedimentation to occur, centrifuging, filtering, etc.);and collecting SPIs (e.g., an SPI mixture) and from the diluted proteinisolate mixture (e.g., collecting the sediment, collecting all or partof a homogenous diluted lipid protein isolate mixture, etc.). Thediluent can include water (e.g., deionized water), an aqueous solution(e.g., water, a mixture of water and other ingredients, etc.), anaqueous solution mixed (e.g., emulsified) with other ingredients, and/orany other diluent. The SPI mixture can include an aqueous component, aprotein component, and/or other ingredients. The protein component caninclude protein isolates, SPIs, aggregates of SPIs, a combinationthereof, and/or any other proteins. The protein concentration (byweight) in the SPI mixture can be between 0.01%-95% or any range orvalue therebetween (e.g., 1%-15%, 30%-50%, 44%, 40%, etc.), but canalternatively be less than 0.01% or greater than 95%.

The proteins can include: globulins (e.g., 2S globulins, 1S globulins,7S globulins, conglutin, napin, sfa, edestin, amandin, concanvalin,vicilin, legumin, cruciferin, helianthinin, etc.), pseudoglobulins,globular proteins, prolamins, albumins, gluten, gliadin, conglycinin,hordein, phasolin, zein, olsosin, caloleosin, sterelosin, conjugatedproteins (e.g., lipoprotein, mucoprotein, etc.), other storage proteins(e.g., seed storage proteins, vegetative storage protein, etc.), animalproteins (e.g., casein, insect proteins, etc.), and/or any othersuitable protein or combination thereof. Proteins can optionally bemodified (e.g., transglutaminase modifications, proteolyticmodifications, glycosylation, glycation, phosphorylation, acylation,etc.) pre- or post-extraction from the protein source. The proteins(e.g., modified or unmodified) can optionally include SPIs, whereinprotein isolate units (e.g., protein monomers arranged in an oligomericcomplex such as a hexamer) can be arranged in: agglomerates, aggregates,micelles, stacks, and/or any other suitable higher-order arrangement(e.g., quaternary structure or higher). The SPI structure can be asphere (e.g., a shell of protein isolate units, a shell or micelle withhydrophilic regions along the exterior and hydrophilic regions along theinterior, etc.), an amorphous structure, and/or any other structure. Theproteins can optionally include an aggregate of SPIs, whereinconstituent SPIs can be arranged in: agglomerates, aggregates, micelles,stacks, and/or any other suitable higher-order arrangement. The proteinscan include casein proteins, non-casein proteins, mammalian proteins,non-mammalian proteins, plant proteins, animal proteins, non-animalproteins, and/or any other proteins. For example, proteins in targetprotein sets can include casein proteins, mammalian proteins, and/oranimal proteins, while proteins in candidate protein sets cansubstantially exclude casein proteins, mammalian proteins, allergenproteins (e.g., proteins from allergens, such as peanuts, soy, wheat,etc.), and/or animal proteins, and/or include plant proteins (e.g.,exclusively include plant proteins). In a specific example, proteins incandidate protein sets can include casein, mammalian, and/or animalproteins below a threshold amount, wherein the threshold amount can bebetween 0.1%-10% or any range or value therebetween (e.g., 10%, 5%, 3%,2%, 1%, 0.1%, etc.), but can alternatively be greater than 10% or lessthan 0.1%.

The protein set can be associated with a protein set composition and/ora total protein quantity (e.g., wherein the total protein quantity is anoverall concentration or amount of proteins within a protein sourceand/or source component, an overall concentration or amount of proteinswithin a product, etc.). The protein set composition can include anidentification of each protein in the set (e.g., a name or otheridentifier for each protein) and/or a concentration of each protein inthe set. The concentration of a protein in the protein set can be anabsolute concentration or a concentration relative to other proteins inthe protein set. In examples, the concentration can be a percentage(e.g., by weight, by mass, by moles, etc.), a ratio, a proportion, anabundance, an amount (e.g., weight, mass, moles, etc.), a ranking (e.g.,wherein each protein in the set is ranked relative to the other proteinsbased on concentration), and/or any other concentration metric. In anillustrative example, the composition of a first protein set can includea first protein (P1) at a concentration C1, and a second protein (P2) ata concentration C2; the composition of a second protein set can includethe same proteins (P1 and P2) at difference concentrations C3 and C4,respectively. The protein set composition and/or the total proteinquantity can be measured (e.g., using an assay), predetermined (e.g.,manually specified), predicted (e.g., based on evolutionaryrelationships, using a prediction model, based on an amino acidcomposition, using a database, etc.), and/or otherwise determined. In afirst specific example, a first protein source is associated with afirst protein set with a known composition and a second protein sourceis associated with a second protein set with an unknown composition,wherein an evolutionary relationship (e.g., based on an evolutionarytree) between the first and second protein sources is used to predictthe composition of the second protein set (e.g., using the assumptionthat certain proteins and/or protein concentrations would be similarbetween the first and second protein sets when the protein sources areevolutionarily close). In a second specific example, an overallcomposition of amino acids in a protein set is determined using an assay(e.g., LC/MS), and the composition of amino acids in each constituentprotein are predicted based on the amino acid sequence for therespective constituent protein. In a third specific example, an overallcomposition of amino acids in a protein set and a composition of aminoacids in each constituent protein are retrieved from an amino acidcomposition database (e.g., a third-party PseAAC database). A model(e.g., a regression) can be used to determine the concentration of eachconstituent protein within the protein set based on the overall aminoacid composition (e.g., of the mixture) and the amino acid compositionsfor the constituent proteins.

The protein set can be associated with one or more sequences (e.g., onesequence for each individual protein in the set). Sequences can includeamino acid sequences, genetic sequences (e.g., DNA sequence, RNAsequence, gene sequence, etc.), any molecular sequence, any proteinsequence, and/or other genetic information. Sequences can be measured(e.g., using an assay), predetermined (e.g., manually specified),predicted (e.g., based on an evolutionary tree, using a predictionmodel, etc.), and/or otherwise determined.

The protein set can be associated with a context. The context caninclude: process parameters, protein modifications, sample environment,and/or any other information associated with the protein set and/or asample (e.g., a food product, a gel, and/or any other product)containing the protein set. The context can be measured (e.g., using anassay), predetermined (e.g., manually specified), predicted, and/orotherwise determined.

The protein set can be associated with one or more protein structures(e.g., one structure for each protein, one structure for eachprotein-context combination, etc.). The protein structures can bemeasured, predicted (e.g., using protein structure prediction models,and/or otherwise determined.

Process parameters are preferably specifications prescribing themanufacturing of a sample containing the protein set (e.g., extractingthe protein set from one or more protein sources, manufacturing thesample using the protein set, etc.), but can be otherwise defined.Process parameters can define: manufacturing specifications; the amountsthereof (e.g., ratios, volume, concentration, mass, etc.); temporalparameters thereof (e.g., when the input should be applied, duration ofinput application, etc.); and/or any other suitable manufacturingparameter. Manufacturing specifications can include: ingredients,treatments, and/or any other sample manufacturing input, wherein theprocess parameters can include parameters for each specification.Examples of ingredients can include: plant matter, proteins, lipids(e.g., fats, oils, etc.; isolated from plant sources; etc.), water,preservatives, acids and/or bases, macronutrients (e.g., protein, fat,starch, sugar, etc.), nutrients, micronutrients, carbohydrates, gums,vitamins, enzymes, emulsifiers, hydrocolloids, salts, chemicalcrosslinkers and/or non-crosslinkers, coloring, flavoring compounds,vinegar, mold powders, microbial cultures (e.g. cheese cultures, such asPenicillium camemberti, Penicillium candidum, Geotrichum candidum,Penicillium roqueforti, Penicillium nalgiovensis, Verticillium lecanii,Kluyveromyces lactis, Saccharomyces cerevisiae, Candida utilis,Debaryomyces hansenii, Rhodosporidum infirmominiatum, Candida jefer,Cornybacteria, Micrococcus sps., Lactobacillus sps., Lactococcus,Staphylococcus, Halomonas, Brevibacterium, Psychrobacter,Leuconostocaceae, Streptococcus thermophilus, Pediococcus sps.,Propionibacteria culture, combinations thereof, etc.), carbon sources,any combination thereof, and/or any other ingredient. Examples oftreatments can include: adjusting temperature, adjusting salt level,adjusting pH level, diluting, pressurizing, depressurizing, humidifying,dehumidifying, agitating, resting, adding ingredients, removingcomponents (e.g., filtering, draining, centrifugation, etc.), adjustingoxygen level, brining, comminuting, fermenting, mixing (e.g.,homogenizing), reactions (e.g., acylation, glycation, phosphorylation,etc.), structural adjustments (e.g., micellization, etc.) and/or othertreatments. Examples of treatment parameters can include: treatmenttype, treatment duration, treatment rate (e.g., flow rate, agitationrate, cooling rate, etc.), treatment temperature, time (e.g., when atreatment is applied, when the sample is characterized, etc.), and/orany other parameters.

Protein modifications can include transglutaminase modifications,proteolytic modifications, glycosylation, glycation, phosphorylation,acylation, hydrolysis, and/or any other protein treatments. The modifiedproteins can be used as ingredients for a downstream product (e.g.,dairy replicate), be used as a product (e.g., be sold as-is, befermented using a cheese culture post-modification, etc.), and/or beotherwise used.

In a first embodiment, proteins (e.g., proteins containing nucleophilicresidues, such as Lys, Ser, Thr, Cys, etc.; SPIs; etc.) can be acylatedusing fatty acyl anhydrides (e.g., caprylic anhydride; myristic acid;stearic acid; oleic acid; linoleic acid; etc.), yielding a fattyacylated protein (e.g., via an amide linkage, such as from Lys; esterlinkage, such as from Ser; thioester linkage, such as from Cys; etc.)and a fatty acid. For example, the ratio between proteins and acylanhydrides (e.g., by weight, by mass, by moles, etc.) can be between1:1-1:4, but can alternatively be greater than 1:1 or less than 1:4.Unreacted fatty acyl anhydride can be quenched (e.g., with hydroxide andwater, a base, a salt, etc.), yielding the corresponding fatty acid. Theresultant fatty acylated protein and/or a sample therefrom can haveincreased lipid binding; increased hydrophobicity; increased gelstrength; increased flow at elevated temperatures (i.e., melt);increased stretchiness; and/or other changed functional property values(e.g., values for texture, nutrition, etc.) relative to the unacylatedprotein or a sample therefrom. In variants, other carboxylic acidconjugation reagents (e.g., acyl chlorides, activated carboxylic acids,metal catalysts, etc.) can additionally or alternatively be used.

In a second embodiment, proteins (e.g., protein residues,surface-accessible nucleophilic residues, etc., SPIs, etc.) can bephosphorylated (e.g., using sodium trimetaphosphate). For example,nucleophilic residues (e.g., Ser, Thr, Lys) of the protein may attacksodium trimetaphosphate (STMP) and/or other reagents, resulting in atriphosphorylated protein which hydrolyses, releasing pyrophosphate toyield the phosphorylated protein. Examples of other phosphorylationreagents that can be used include: other trimetaphosphate salts;hexametaphosphate salts; tripolyphosphate salts; polyphosphate salts;nucleoside triphosphates, and/or other phosphorylation agents. Invariants, phosphorylation can be performed using non-toxic (e.g., atrelevant concentrations) catalysts, reagents, byproducts, and/or othersubstances. The resultant phosphorylated protein and/or a sampletherefrom can have increased calcium binding (e.g., an increased calciumconcentration in the sample); increased stretchiness; increased flow atelevated temperatures (i.e., melt); increased solubility; increasedhydrophobicity and/or hydrophilicity; decreased toxicity; decreasedhydrophobicity and/or hydrophilicity; and/or other changed functionalproperty values relative to an unphosphorylated protein and/or a sampletherefrom.

In an example, the proteins (e.g., protein isolates, SPIs, dissolved andresuspended protein source substrate, etc.) can be suspended in aprotein solution at a target protein concentration. The target proteinconcentration in the protein solution and/or the target proteinconcentration in a final mixture (e.g., including protein, acids/bases,phosphorylation reagent, calcium, etc.) is between 3%-50% or any rangeor value therebetween (e.g., 4-10%, 6%, 9%, greater than 6%, greaterthan 9%, 10-20%, 15%, etc.), but can alternatively be less than 5% orgreater than 50%. In a specific example, the proteins are diluted toachieve the target concentration. The diluent can be water (e.g.,deionized water), an aqueous solution (e.g., water, a mixture of waterand other ingredients, etc.), and/or any other diluent. The proteinsolution can optionally be homogenized (e.g., for 30s-10 min, 1 min, 2min, 3 min, 5 min, any other time, etc.). The pH of the protein solutioncan be adjusted to a target pH, wherein the target pH is between 3-12 orany range or value therebetween (e.g., 3-5, 4, 5-7, 6, above 6, above 7,below 7, 10-11, 10, 10.5, 11, etc.), but can alternatively be less than3 or greater than 12. A solution including a phosphorylation reagent(e.g., Na₃(PO₃)₃) can be added to the protein solution to achieve atarget concentration in a final mixture (e.g., 20 mM-1000 mM, 100 mM-500mM, 300 mM-400 mM, 80 mM, 150 mM, 250 mM, 350 mM, greater than 150 mM,greater than 80 mM, less than 80 mM, etc.). The resulting (intermediate)mixture can optionally be homogenized (e.g., for 30 s-10 min, 1 min, 2min, 3 min, 5 min, any other time, etc.). The resulting mixture can bestirred for between 15 min-10 hrs or any range or value therebetween(e.g., 30 min-2 hrs, 1 hr, etc.), but can alternatively be stirred forless than 15 min or greater than 10 hrs. The stir rate can be between100 rpm-10,000 rpm or any range or value therebetween (e.g., 300rpm-1,000 rpm), but can alternatively be less than 100 rpm or greaterthan 10,000 rpm. The temperature while stirring can be between 10°C.-50° C. or any range or value therebetween (e.g., 20° C.-30° C., roomtemperature, etc.), but can alternatively be less than 10° C. or greaterthan 50° C. Calcium can optionally be added to the mixture (e.g., tobind calcium to the phosphorylated proteins, to enable the reaction toproceed forward, etc.) before or after phosphorylating agent addition.For example, a solution of calcium salts (e.g., CaCl2)) can be added tothe mixture to achieve a target concentration in a final mixture (e.g.,5 mM-40 mM, 20 mM-10000 mM, 20 mM-1000 mM, 80 mM, 100 mM, 140 mM, 240mM, 300 mM, 400 mM, greater than 140 mM, less than 400 mM, etc.). Theprocess parameters in this example can optionally achieve a sticky(e.g., increased adhesion, decreased hardness, etc.) texture in a sampleproduced using the phosphorylated proteins. An example is shown in FIG.14 . Additionally or alternatively, the texture of the sample can behardened by increasing the amount of phosphorylating agent, decreasingthe amount of calcium salts, decreasing the amount of protein in thestarting protein solution, and/or decreasing the pH.

In examples, the phosphorylated proteins can optionally be collected,such as via centrifugation (e.g., collecting the sediment aftercentrifugation), filtration, precipitation, and/or other proteinisolation methods, wherein the proteins can be used in all or parts ofthe method. The centrifugation speed can be between 500 rpm-20,000 rpmor any range or value therebetween (e.g., 1,000 rpm-10,000 rpm, 5,000rpm, etc.), but can alternatively be less than 500 rpm or greater than20,000 rpm. The centrifugation time can be between 30 s-1 hr or anyrange or value therebetween (e.g., 5 min-30 min, 10 min, 20 min, etc.),but can alternatively be less than 30s or greater than 1 hr. Theproteins can optionally be resuspended after collection (e.g., washed)in a diluent (e.g., water). The ratio (by volume) between the collectedprotein and the diluent can be between 1:10-10:1 (e.g., 1:3, 1:2, 1:1,2:1, 3:1, etc.), but can alternatively be less than 1:10 or greater than10:1.

However, proteins can be otherwise phosphorylated.

In a third embodiment, proteins (e.g., protein residues,surface-accessible lysine residues, etc.) can be glycated. For example,lysine residues and/or other residues can covalently bond to sugars(e.g., via nucleophilic attack of an acyclic sugar's aldehyde),resulting in a glycated protein. In variants, glycation can be performedusing non-toxic (e.g., at relevant concentrations) catalysts, reagents,byproducts, and/or other substances. The resultant glycated proteinand/or a sample therefrom can have increased flow at elevatedtemperatures (e.g., melt); increased solubility; increasedhydrophobicity and/or hydrophilicity; decreased toxicity; decreasedhydrophobicity and/or hydrophilicity; and/or other changed functionalproperty values relative to an unglycated protein and/or a sampletherefrom.

Maillard glycation is conventionally achieved at high temperatures,which may result in protein denaturation and accelerates later stagereactions, including those resulting in advanced Maillard products(AMPs). AMPs can give rise to off-flavours and/or off-colours in asample. In variants, the method can include catalyzing an initialglycation event (e.g., via base catalysis and/or acid catalysis), whichcan reduce or remove the need for high temperatures (e.g., in theinitial and/or later stages).

In an example, glycating proteins can include: combing proteins andsugars in a solution; adjusting a pH of the solution; and adjusting atemperature of the solution.

The proteins (e.g., protein isolates, SPIs, dissolved and resuspendedprotein source substrate, etc.) and sugars can be combined in thesolution (e.g., dissolved in a diluent such as water) at a targetprotein concentration and a target sugar concentration. The targetprotein concentration (e.g., by weight) can be between 5%-60% or anyrange or value therebetween (e.g., 10%-40%, 15%-35%, 15%-25%, 25%-35%,etc.), but can alternatively be less than 5% or greater than 60%. Thetarget sugar concentration can be 5%-70% or any range or valuetherebetween (e.g., 20%-40%, 20%-30%, 30%-40%, etc.), but canalternatively be less than 5% or greater than 70%. Examples of sugarsthat can be used include: monosaccharides such as pentoses and hexoses(e.g., ribose, arabinose, xylose, glucose, galactose, fructose, etc.);disaccharides; oligosaccharides; polysaccharides; and/or any othersugars. The sugars can be plant-based, synthesized, and/or otherwiseobtained. In variants, the sugar used can be selected based on itsreactivity. For example, pentoses can be preferred to hexoses, which canbe preferred to disaccharides, which can be preferred tooligosaccharides, which can be preferred to polysaccharides. However,the sugars can be otherwise selected.

The pH of the solution during glycation can be adjusted to a target pH.In a first specific example, all or parts of the glycation reaction canbe performed at an acidic pH (e.g., an acid-catalyzed reaction). Thetarget pH can be between 2-7 or any range or value therebetween (e.g.,3-6.5, 4-6, less than 6, etc.), but can alternatively be less than 2 orgreater than 7. Examples of acid catalysts that can be used to adjustthe pH can include: hydrochloric acid, Bronsted acids, Lewis acids,and/or other acids. In a second example, all or parts of the glycationreaction can be performed at a basic pH (e.g., a base-catalyzedreaction). The target pH can be between 7-11.5 or any range or valuetherebetween (e.g., 8-11, 9-10.5, 9-10, 10-10.5, greater than 8, greaterthan 9, etc.), but can alternatively be less than 7 or greater than11.5. Examples of base catalysts that can be used to adjust the pH caninclude: sodium hydroxide, sodium bicarbonate, potassium bicarbonate,ammonium bicarbonate, and/or other bases. The acids and/or bases arepreferably food safe, but can alternatively be not food safe.

The temperature of the solution can be adjusted to a target temperaturefor a target reaction time (e.g., wherein the temperature is maintainedthroughout the reaction time, wherein the temperature is adjusted duringthe reaction time, etc.). The target temperature can be between 10°C.-200° C. or any range or value therebetween (e.g., at or above 45° C.,40° C.-80° C., at or above 50° C., 55° C.-70° C., below 55° C., at roomtemperature, above room temperature, etc.), but can alternatively beless than 10° C. or greater than 200° C. The target temperature ispreferably below the protein's denaturation point, but can alternativelybe at or above the denaturation point. The target reaction time can bebetween 1 hour-1 week or any range or value therebetween (e.g., 5 hrs-10hrs, 8 hrs, 24 hrs-48 hrs, 12 hrs-24 hrs), but can alternatively be lessthan 1 hour or greater than 1 week.

However, the proteins can be otherwise glycated.

In examples, modified proteins and/or a sample therefrom can havechanged functional property values. The change in functional propertyvalue can be determined relative to a protein source, an unmodifiedprotein, a reaction intermediary, a sample therefrom, and/or relative toany other compound or substance. Examples of changes can include: 5%,10%, 30%, 50%, 80%, a range therebetween, over 80%, and/or any otherincreased or decreased proportion. In variants, one or more proteinmodification process variables can be selected, controlled, adjusted,and/or otherwise manipulated to achieve a target functional propertyvalue (e.g., target texture). Examples of variables that can becontrolled include: the protein source; protein preprocessing methods(e.g., protein isolation techniques, etc.); protein configuration (e.g.,protein isolates, structured or unstructured arrangement of proteinisolates, etc.); reagents; protein and/or reagent concentrations;stoichiometric ratio between protein and reagents; reaction scale (i.e.,mass of initial protein substrate, volume of solvent); reaction time;reaction temperature; reaction pH, quenching or not quenching; washing(e.g., removal of unreacted reactants and byproducts such aspyrophosphate, unreacted sugars, AMPs, etc.) or not washing;concentration (e.g., presence vs absence of acids, bases, and/or otheringredients; and/or other variables.

The sample environment can include: a composition of the sample (e.g.,other macronutrients and their respective concentrations), samplestructure information (e.g., sample matrix type; sample porosity; samplephase such as solid, liquid, and/or gaseous; etc.), pH level,temperature (e.g., temperature at which the functional properties forthe sample would be measured), pressure, isoelectric point, and/or anyother sample parameters.

The protein set can be associated with values for one or morecharacteristics and/or can be uncharacterized (e.g., lack values for oneor more characteristics). Characteristics can include: features,functional properties (e.g., an example of functional properties isshown in FIG. 4 ), functionalities (e.g., storage functionalities,breaking down sugar and/or any other molecule, enzyme functionalities,etc.), and/or any other characteristics.

Features are preferably sequence features (e.g., extracted from one ormore amino acid sequences), but can alternatively be other proteincharacteristics (e.g., molecular features, physicochemical features,protein structure features, context features, etc.). Features can behuman-interpretable (e.g., semantic features, where features representspecific properties, where the influence of a feature on functionalproperties is understood, etc.) or not human-interpretable (e.g.,nonsemantic). Optionally, features can be annotated to providehuman-interpretable context (e.g., by using an explainability orinterpretability method applied to one or more models, etc.).

A feature set can include: all possible features, a subset of features(e.g., selected using dimensionality reduction, selected using a featureselection model, selected features based on correlation with specificfunctional properties, etc.), a user-defined set of features, weightedfeatures, aggregated features, and/or any other suitable set offeatures. The features within the feature set can be: learned (e.g.,using an autoencoder, using a deep learning model, etc.), handcrafted,and/or otherwise determined.

Each protein set can be associated with one feature value set (e.g., anaggregate feature value set), multiple feature value sets (e.g., onefeature value set for each constituent protein, different feature valuesets corresponding to different folding configurations, differentfeature value sets corresponding to different contexts, etc.), not havea feature value set, and/or be associated with any other feature valueset. In an illustrative example, a feature value set is a feature valuevector, wherein each element is a feature value for a feature in afeature set (e.g., a feature vector). In a first embodiment, eachprotein in the protein set is associated with a feature value set,wherein an aggregate feature value set (e.g., a representative featurevalue set) is determined for the protein set based on the feature valuesof the constituent proteins using a feature aggregation model (e.g.,examples shown in FIG. 5A and FIG. 5B). In a second embodiment, afeature value set for the protein set can be directly determined (e.g.,using a feature extraction model, using a machine learning model, etc.).

Features values can include and/or be extracted (e.g., using a featureextraction model) from: sequences (e.g., amino acid sequences, geneticsequences, etc.), measurements and/or other data, structures (e.g.,primary, secondary, or tertiary structures that are known, measured,computer-generated, etc.), context, other feature values, and/or anyother information. Examples of features can include: amino acidcomposition-based features, autocorrelation-based features,profile-based features, pseudo amino acid composition, sequence features(e.g., AA groups, active sites, binding sites, PTM sites, repeats,etc.), domain features, physicochemical features, domains, and/or anyother feature. For example, features can include and/or be based on:k-mers; pseudo structure status composition (PseSSC); pseudo amino acidcomposition (PseAAC); composition, transition, and distribution (CTD);grand average of hydropathicity index (GRAVY); autocovariance;auto-cross covariance; top-n-gram; overall amino acid count; countand/or percentage of a specific amino acid; amino acid structure (e.g.,amino acid subsequence organization within the amino acid sequence);charge (e.g., overall charge, charge distribution, charge at a given pH,etc.); acidity; hydrophilicity/hydrophobicity; functional groups;flexibility; instability; aromaticity; length; molecular weight; bindingaffinity; active sites (e.g., count, structure, location, etc.);physicochemical and/or molecular features of amino acids; and/or anyother feature.

However, features and/or feature values can be otherwise defined.

Functional properties can include macro functional properties, microfunctional properties, nano functional properties, a combinationthereof, other characteristics, and/or any other functional properties.

The set of functional property values for a protein set functions todefine how the protein set and/or proteins in the protein set: behavesduring sample preparation or cooking, influences the finished sample(e.g., in look, feel, taste, etc.), interacts with other molecules(e.g., secondary interactions, tertiary interactions, quaternaryinteractions, etc.), denatures (e.g., the denaturization point), folds,aggregates, other target functionalities, and/or any other behavior atthe nano, micro, and/or macro scale (e.g., behaviors between the proteinas a whole and the context or other proteins, etc.). Functionalproperties can include: nutritional profile (e.g., macronutrientprofile, micronutrient profile, etc.), texture (e.g., texture profile,firmness, toughness, puncture, stretch, compression response, mouthfeel,viscosity, graininess, relaxation, stickiness, chalkiness, flouriness,astringency, crumbliness, stickiness, stretchiness, tearability, mouthmelt, etc.), solubility, melt profile, smoke profile, gelation point,flavor, appearance (e.g., color, sheen, etc.), aroma, precipitation,stability (e.g., room temperature stability), emulsion stability, ionbinding capacity, heat capacity, solid fat content, chemical properties(e.g., pH, affinity, surface charge, isoelectric point,hydrophobicity/hydrophilicity, chain lengths, chemical composition,nitrogen levels, chirality, stereospecific position, etc.),physiochemical properties, compound concentration (e.g., in the solidsample fraction, vial headspace, olfactory bulb, post-gustation, etc.),denaturation point, denaturation behavior, aggregation point,aggregation behavior (e.g., micellization capability, micelle stability,etc.), particle size, structure (e.g., microstructure, macrostructure,fat crystalline structure, etc.), folding state, folding kinetics,interactions with other molecules (e.g., dextrinization, caramelization,coagulation, shortening, interactions between fat and protein,interactions with water, aggregation, micellization, etc.), fat leakage,water holding and/or binding capacity, fat holding and/or bindingcapacity, fatty acid composition (e.g., percent saturated/unsaturatedfats), moisture level, turbidity, interactions within the protein set(e.g., protein aggregation), properties determined using an assay tool,and/or any other properties. In examples, functional properties caninclude physicochemical and/or biochemical properties of amino acidsand/or clusters of amino acids in each protein.

A functional property set can include: all possible functional propertyvalues, a subset of functional properties (e.g., selected usingdimensionality reduction, selected using a functional property selectionmodel, etc.), a user-defined set of functional properties, weightedfunctional properties, and/or any other suitable set of functionalproperties.

Functional property values sets can be associated with an individualprotein, the entire set of proteins (e.g., a protein mixture; where eachprotein in the set is assigned the same functional property values,where functional property values are assigned to each protein based onindividual protein concentrations within the set, etc.), a subset of theprotein set (e.g., one or more proteins with the highest concentrationswithin the set), and/or be unassociated with the protein set (e.g.,manually defined target functional properties). Each protein set can beassociated with one functional property value set (e.g., wherein thefunctional property value set includes a value for each functionalproperty in a functional property set), multiple functional propertyvalue sets (e.g., a protein set can be associated with differentfunctional property value sets corresponding to different contexts), nothave a functional property value set (e.g., uncharacterized), or beassociated with any other functional property value set. In a specificexample, a given protein can be associated with multiple functionalproperty value sets, wherein each functional property value setcorresponds to different protein sets that include the given protein.Functional property values can optionally include an uncertaintyparameter (e.g., measurement uncertainty, determined using statisticalanalysis, etc.).

The functional property values can be determined experimentally (e.g.,using an assay tool), determined via computer simulations, predicted(e.g., using a prediction model, based on the sample context, otherfunctional properties, other inputs, etc.), and/or be otherwisedetermined. The functional property values can be: directly measured,analyzed and/or transformed data, features extracted from data (e.g., adata time series), and/or be otherwise determined.

However, functional properties and/or functional property values can beotherwise defined.

The system can optionally leverage one or more assays. Propertiesdetermined using an assay tool can optionally be and/or be used todetermine any functional property value and/or feature value. Examplesof assays and/or assay tools that can be used include: a differentialscanning calorimeter (e.g., to determine properties related to melt,gelation point, denaturation point, etc.), Schrieber Test, an oven(e.g., for the Schrieber Test), a water bath, a texture analyzer, arheometer, spectrophotometer (e.g., determine properties related tocolor), centrifuge (e.g., to determine properties related to waterbinding capacity), moisture analyzer (e.g., to determine propertiesrelated to water availability), light microscope (e.g., to determineproperties related to microstructure), atomic force microscope (e.g., todetermine properties related to microstructure), confocal microscope(e.g., to determine protein association with fat/water), staining (e.g.,paired with computer vision models), laser diffraction particle sizeanalyzer (e.g., to determine properties related to emulsion stability),polyacrylamide gel electrophoresis system (e.g., to determine propertiesrelated to protein composition), phos-tag acrylamide gel electrophoresis(e.g., to determine extent of phosphorylation), acrylamide gelelectrophoresis (e.g., to determine extent of glycation), massspectrometry (MS), gas chromatography (GC) (e.g., gaschromatography-olfactometry, GC-MS, etc.; to determine propertiesrelated to aroma/flavor, to determine properties related to proteincomposition, etc.), liquid chromatography (LC), LC-MS, fast protein LC(e.g., to determine properties related to protein composition), proteinconcentration assay systems, thermal gravimetric analysis system,thermal shift (e.g., to determine protein denaturization and/oraggregation behavior), ion chromatography, dynamic light scatteringsystem (e.g., to determine properties related to particle size, todetermine protein aggregation, etc.), Zetasizer (e.g., to determineproperties related to surface charge), protein concentration assays(e.g., Q-bit, Bradford, Biuret, Lecco, etc.), particle size analyzer,sensory panels (e.g., to determine properties related to texture,flavor, appearance, aroma, etc.), capillary electrophoresis SDS (e.g.,to determine protein concentration), spectroscopy (e.g., fluorescencespectroscopy, circular dichroism, etc.; to determine folding state,folding kinetics, denaturation temperature, etc.), absorbancespectroscopy (e.g., to determine protein hydrophobicity), CE-IEF (e.g.,to determine protein isoelectric point/charge), total proteinquantification, high temperature gelation, microbial cloning, Turbiscan,stereospecific analysis, and/or any other assay and/or assay tool. In anillustrative example, a sample made using the protein set can be stained(e.g., for lipids and proteins), imaged, and analyzed (e.g., using theimage) to determine the sample's lipid and protein structure (e.g.,treated as a functional property). The sample's can optionally bemeasured using GC-MS to determine the chemical composition of thesample.

The method can be used with one or more targets, wherein one or morecandidate protein sets (e.g., analogous protein sets) can be determinedbased on the target (e.g., to replace a target protein set, tomanufacture an analog for a target product, to identify a protein setwith target characteristic values, etc.). Candidate protein sets caninclude: proteins found in a predetermined set of protein sources,proteins expressed by a predetermined set of species, genus, family,and/or other set of organisms, and/or other proteins. For example,candidate protein sets can include proteins found in plant-based sources(e.g., substantially excluding animal-based sources),naturally-occurring sources, genetically modified sources, syntheticsources, and/or any other suitable source. Target protein sets caninclude: a protein set to be replaced or replicated, or any otherprotein set. For example, target protein sets can include proteins foundin animal-based sources (e.g., dairy sources).

The target can include one or more: target characteristics (e.g.,features, functional properties, etc.), target characteristic values,target protein sets (e.g., a single protein set, a composition ofprotein sets, etc.), target sources, target products (e.g., target foodproducts), and/or other targets. Examples of target food productsinclude: dairy fats (e.g., ghee, other bovine milk fats, etc.), milk(e.g., cow milk, sheep milk, goat milk, human milk, etc.), cheese (e.g.,hard cheese, soft cheese, semi-hard cheese, semi-soft cheese), yogurt,cream cheese, dried milk powder, cream, whipped cream, ice cream, coffeecream, other dairy products, egg products (e.g., scrambled eggs),additive ingredients, mammalian meat products (e.g., ground meat,steaks, chops, bones, deli meats, sausages, etc.), fish meat products(e.g., fish steaks, filets, etc.), any animal product, and/or any othersuitable food product. In specific examples, the target food productincludes mozzarella, burrata, feta, brie, ricotta, camembert, chevre,cottage cheese, cheddar, parmigiano, pecorino, gruyere, edam, gouda,jarlsberg, and/or any other cheese.

Target characteristic values can optionally be characteristic values fora target product and/or for a target protein set (e.g., associated witha target product). Target characteristic values can include a singlevalue and/or ranges. A target can be: a single target (e.g., a singletarget characteristic value set for a given protein set) or aggregatedtargets (e.g., a vectorized set of feature values and/or functionalproperty values aggregated across multiple protein sets, etc.). A targetcan be: a positive target (e.g., where positive target features arepositively correlated with target functional properties; where desiredcharacteristics are positive targets; etc.), or a negative target (e.g.,where negative target features are negatively correlated with targetfunctional properties; where undesired characteristics are negativetargets; etc.); an example is shown in FIG. 10 . In a first variant, thetarget characteristic values include desired feature values; an exampleis shown in FIG. 11 . In a second variant, the target characteristicvalues include desired functional property values (e.g., associated witha target protein set, manually specified, etc.); examples shown in FIG.10 and FIG. 12 .

However, the target can be otherwise defined.

The system can include one or more models, including feature extractionmodels, correlation models, feature selection models, functionalproperty selection models, prediction models, protein set determinationmodels, feature aggregation models, similarity models, structureprediction models, and/or any other model. Any model can include:regression, classification, neural networks (e.g., CNNs, DNNs, etc.),rules, heuristics, equations (e.g., weighted equations, etc.), selection(e.g., from a library), instance-based methods (e.g., nearest neighbor),regularization methods (e.g., ridge regression), decision trees, modelsused in Bayesian methods (e.g., Naïve Bayes, Markov), optimizationmethods, kernel methods, probability, deterministics, genetic programs,support vectors, and/or any other suitable method.

The models can include classical machine learning models (e.g., linearregression, logistic regression, decision tree, SVM, nearest neighbor,PCA, SVC, LDA, LSA, t-SNE, naïve bayes, k-means clustering, clustering,association rules, dimensionality reduction, etc.), neural networks(e.g., CNN, CAN, LSTM, RNN, autoencoders, deep learning models, etc.),ensemble methods, heuristics, and/or any other suitable model. Themodels can be scoring models, numerical value predictors (e.g.,regressions), classifiers (e.g., binary classifiers, multiclassclassifiers, etc.), and/or provide other outputs.

The models can be trained and/or learned, fit, predetermined, and/or canbe otherwise determined. The models can be learned using: supervisedlearning, unsupervised learning, reinforcement learning, Bayesianoptimization, positive-unlabeled learning, and/or otherwise learned. Inspecific examples, models can be trained using multiple-instancelearning (MIL), learning to aggregate (LTA), and/or any other trainingapproach. The models can be learned or trained on: labeled data (e.g.,data labeled with the target label), unlabeled data, positive trainingsets (e.g., a set of data with true positive labels, negative trainingsets (e.g., a set of data with true negative labels), and/or any othersuitable set of data.

The models can be specific to: functional properties, a protein set, acontext, a target, and/or otherwise specific, or be generic. The featureextraction model can function to extract values for features for aprotein set (e.g., for each protein in the set, for the protein set as awhole, etc.). The feature extraction model can output feature valuesbased on molecular information inputs (e.g., sequences, measurements,data, structure, protein set composition, etc.), context, and/or otherinformation. The feature extraction model can use: folding analysis,classifiers, the reduced alphabet approach, Markov models, statisticalmethods, n-gram analysis, autocovariance, auto-cross covariance, proteindescriptor methods (e.g., PseSSC, PseAAC, CTD, GRAVY, etc.), any proteinanalysis methods, encoders (e.g., trained to encode the sequence to ashared latent space), and/or any other feature extraction technique. Ina first example, the feature extraction model extracts handpickedfeatures (e.g., wherein the feature extraction model is trained on apredetermined training value for the feature). In a second example, thefeature extraction model can be adopted from another domain (e.g., be alinguistic feature model). In a third example, the feature extractionmodel can be a subset of the layers from a model trained end-to-end topredict another attribute (e.g., wherein the features can be learnedfeatures). In an illustrative example, the feature extraction model canbe a subset of layers (e.g., the first several layers, featureextraction layers, intermediary layers, etc.) of a prediction modeltrained to predict functional property values from protein sequences,context, and/or other inputs (e.g., example shown in FIG. 14 ). However,the feature extraction model can be otherwise configured.

The extracted features for the protein set can be represented as one ormore feature vectors, wherein each vector position can represent adifferent feature. In a first variant, a feature vector is determinedfor each protein within the set, wherein the feature value is determinedbased on the protein's sequence and optionally the protein's abundanceor concentration within the protein set. Alternatively, the protein'sabundance or concentration can be represented by a separate vector. In asecond variant, a feature vector is determined for each protein set,wherein each feature's value is representative of the feature value forthe protein set as a whole. In an example, the protein set featurevector is determined based on the feature's values for each protein inthe protein set (e.g., wherein the different values for a givenfunctional feature are aggregated, predicted, etc.), and optionallydetermined based on the respective protein's abundance within theprotein set (e.g., weighted based on the respective protein's abundancewithin the set, etc.). However, the extracted features can be otherwiserepresented.

The optional feature aggregation model can function to aggregate featurevalues across proteins in a protein set. The feature aggregation modelinputs can include: a feature value set (e.g., a feature value vector)for each protein in the protein set, a feature value set for eachprotein in a subset of the protein set, a protein set composition,context, and/or any other protein set information. The featureaggregation model outputs can include an aggregate feature value set(e.g., an aggregate feature value vector) for the protein set. Thefeature aggregation model can optionally interface with and/or be partof the prediction model (e.g., wherein the prediction model aggregatesfeature values).

The feature aggregation model can leverage classical or traditionalapproaches (e.g., heuristics, equations, etc.), leverage machinelearning approaches (e.g., have learned parameters/weights, use MIL(multiple instance learning), use LTA learning, etc.), and/or beotherwise constructed. In a first embodiment, the feature aggregationmodel is a traditional or classical model. For example, the featureaggregation model can include a weighted combination (e.g., weightedaverage, etc.) of the feature value sets for individual proteins in theprotein set, wherein the weights can be based on protein type, proteinset composition (e.g., protein concentration, protein abundance in theprotein set, etc.), and/or any other protein information. In a secondembodiment, the feature aggregation model is a neural network. In afirst example, the feature aggregation model includes a weightedcombination of feature value sets for individual proteins in the proteinset with optional interaction terms, wherein the weights and/or theinteraction terms are learned parameters. In a second example, thefeature aggregation model is the prediction model trained using MIL,wherein each instance is an individual protein with a respectiveconcentration, each bag is a protein set, and bag labels are functionalproperty values.

However, the feature aggregation model can be otherwise configured.

The prediction model can function to predict functional property valuesfor a protein set. The prediction model can incorporate a correlationmodel, feature selection model, functional property selection model,feature aggregation model, and/or any other model. The prediction modelinputs can include: a feature value set for each protein in the proteinset (e.g., a feature value vector), a feature value set for the proteinset (e.g., a feature value vector for the protein set as a whole, anaggregate feature value vector, etc.), protein set composition, context(e.g., parametrized into a context vector), correlation information(e.g., outputs from the correlation model), and/or any other protein setinformation. The prediction model outputs can include: a functionalproperty value set and/or any other protein set information. Theprediction model can include a single model and/or multiple models. Whenthe prediction model includes multiple models, the models can bearranged in series, in parallel, as distinct models, and/or otherwisearranged. When the prediction model includes multiple models, the modelscan be trained separately (e.g., using distinct training data sets),trained together (e.g., using the same training data set, usingdifferent subsets of the same training data set, etc.), and/or otherwisetrained.

In a first variant, the prediction model outputs functional propertyvalues based on feature values associated with the protein set (e.g.,feature values for individual proteins in the protein set and/or for theprotein set as a whole). An example shown in FIG. 8 . The model canoptionally predict the functional property value based on the context;an example is shown in FIG. 13 . For example, the context can beparametrized into a context vector, wherein the context vector can beappended to the protein set feature vector or provided as another inputinto the model. The model can predict a value for a single functionalproperty (e.g., be a regression, classifier trained on a singlefunctional property, etc.), values for multiple functional properties(e.g., be a multiclass classifier), and/or values for any other suitableset of functional properties.

In a second variant, the prediction model predicts functional propertyvalues based on protein sequences for the protein set. In an example,the prediction model can output a vector, wherein each vector positioncan represent a different functional property and the vector value canrepresent the predicted value for said functional property.

In a third variant, the prediction model predicts a functional propertysimilarity score, indicative of the protein set's functional propertysimilarity to a target sample's functional property, wherein the modelcan be analyzed (e.g., using an acquisition function) to determine whichprotein set (and/or feature vector) can produce a sample with functionalproperties that are closer to the target sample (e.g., using a Bayesianoptimization technique).

In a fourth variant, the prediction model predicts the protein set thatcan produce the target functional property values, target featurevalues, and/or other target. The prediction model (and/or another model)can optionally predict the context (e.g., process parameters) needed toproduce the target functional property values. The prediction model canpredict: which proteins should be included in the protein set, theamount of each protein in the protein set, and/or other aspects of theprotein set. In an example, the prediction model predicts a vector,wherein each vector position represents a different protein, and eachvalue represents an amount of the respective protein. In a secondexample, the prediction model predicts a protein inclusion vector (e.g.,which proteins should be in the set) and a protein amount vector (e.g.,how much of the included proteins should be in the set). The two vectorscan be predicted serially (e.g., protein inclusion vector first, thenprotein amount vector), at the same time, by the same model, bydifferent models, and/or otherwise predicted.

However, the prediction model can be otherwise configured.

The optional protein set determination model (e.g., selection model) canfunction to determine a candidate protein set with characteristic valuesthat closely match target characteristic values (e.g., the best/closestmatch, a match below a threshold, etc.). The protein set determinationmodel inputs can include target characteristic values (e.g., targetfunctional property values, target feature values, etc.), constraints(e.g., context constraints), the database, predicted characteristicvalues (e.g., predicted functional property values for each of a set ofcandidate protein sets), and/or any other information. The protein setdetermination model outputs can include: the candidate protein set(e.g., a candidate protein set selected from the database), thecomposition of the candidate protein set (e.g., the concentration foreach protein in the set), the context for the candidate protein set, aningredient (e.g., from which the candidate protein set can be derived;for use in product manufacture or target analog manufacture; etc.),and/or any other protein set information. The protein set determinationmodel can use: comparison methods (e.g., matching, distance metrics,etc.), thresholds, optimization methods, regression, selection methods,classification, neural networks (e.g., CNNs, DNNs, etc.), clusteringmethods, rules, heuristics, equations (e.g., weighted equations, etc.),and/or any other methods. For example, the protein set determinationmodel can search the database for a candidate protein set and/ordetermine a new protein set based on the target characteristics. Theprotein set determination model can optionally interface with and/or bepart of the prediction model, the similarity model, and/or any othermodel. In a specific example, the protein set determination model caninterface with and/or include the prediction model, wherein functionalproperty values are predicted for each of a set of protein sets (e.g.,uncharacterized protein sets) using the prediction model. The proteinset determination model can then select a candidate protein set based ona comparison between the predicted functional property values and targetfunctional property values (e.g., using the similarity model). Inanother example, the protein set determination model can determine thetarget feature values for a target protein set and identify a candidateprotein set based on a comparison (e.g., the similarity) between therespective feature values (for the candidate protein set) and the targetfeature values, and/or a comparison (e.g., dissimilarity) between therespective feature values and a set of negative target feature values(e.g., feature values from protein sets to avoid).

However, the protein set determination model can be otherwiseconfigured.

The optional correlation model can function to determine thecorrelation, interaction, and/or any other association between featuresand functional properties. For example, a correlation model candetermine correlations between features and functional properties.However, the correlation model can determine correlations between anyfirst set of features and/or functional properties and any second set offeatures and/or functional properties.

The correlation model inputs can include features (e.g., specifying asubset of features for correlation), feature values (e.g., individualprotein feature values and/or aggregate feature values), sequences,functional properties (e.g., specifying a subset of functionalproperties for correlation), functional property values (e.g., where thefeature values and/or functional property values are associated viacommon protein sets in the database), context, protein set compositions,the database, and/or any other information. The correlation modeloutputs can include a mapping between features (e.g., features, featurevalues, ranges of values, etc.) and functional properties (e.g.,functional properties, functional property values, ranges of values,etc.), wherein the mapping can include: correlation coefficients (e.g.,negative and/or positive), interaction effects (e.g., negative and/orpositive, where a positive interaction effect can represent an increasedsignificance effect of feature A on a functional property when in thepresence of feature B), an association, and/or other correlation metric.The correlation model can use: classifiers, SVMs, ANNs, RF, conditionalrandom field (CRF), K-nearest neighbors, statistical methods, and/or anyother method.

In variants, the mapping between features and functional properties canbe an association between features and functional properties (e.g., anautocorrelation feature is correlated with stretchability), featurevalues and/or ranges thereof with functional properties (e.g., a firstrange of autocorrelation values is correlated with stretchability, whilea second range of autocorrelation values is correlated withspreadability, etc.), features with functional property values and/orranges thereof, feature values and/or ranges thereof with functionalproperty values and/or ranges thereof (e.g., autocorrelation values arecorrelated with spreadability values), combinations of features withcombinations of functional properties (e.g., including interactioneffects between features), combinations of feature values withcombinations of functional property values, and/or any otherassociation.

The correlation model can optionally be trained on a set ofcharacterized protein sets (e.g., characterized with feature values,functional property values, etc.). In variants, the correlation modelcan identify similar and/or divergent feature values (e.g., calculatingan implicit and/or explicit similarity measure) between protein sets andcorrelate those features to functional properties. For example, featureswith differing values (e.g., across protein sets) can be mapped to thefunctional properties with differing values (e.g., across the sameprotein sets). In a first specific example, a first feature is mapped tomeltability when the feature values for two protein sets aresubstantially similar (e.g., within a threshold) except for the firstfeature's values, and the functional property values for the two proteinsets are substantially similar except for the meltability values. In asecond specific example, feature value differences (e.g., sequencedifferences determined using a sequence alignment method, a classifier,etc.) between related proteins (e.g., where a relation is determinedusing an evolutionary tree) can be correlated with differences in therespective functional property values.

However, the correlation model can be otherwise configured.

The optional feature selection model can function to select a subset offeatures (e.g., to reduce feature dimensions, to select features likelyinfluencing functional properties, etc.). The feature selection modelinputs can include: features, feature values, functional properties,functional property values, target characteristic values, correlationinformation (e.g., outputs from the correlation model, correlationcoefficients, interaction effects, etc.), the database, and/or any otherprotein set information. The feature selection model outputs caninclude: a feature subset, target features (e.g., positive and/ornegative targets), and/or any other features. The feature selectionmodel can use: supervised selection (e.g., wrapper, filter, intrinsic,etc.), unsupervised selection, recursive feature selection, liftanalysis (e.g., based on a feature's lift), any explainability and/orinterpretability method (e.g., SHAP values), and/or with any otherselection method. The feature selection model can be a correlation model(and/or vice versa), can include a correlation model (and/or viceversa), can take correlation model outputs as inputs (and/or viceversa), be otherwise related to a correlation model, and/or be unrelatedto a correlation model.

The feature selection model can optionally be trained to select relevantfeatures for functional property value prediction. For example, thetraining target can be a subset of features with high (positive and/ornegative) interaction effects and/or correlation with functionalproperties (e.g., a correlation coefficient for a feature and/or featureset given a target functional property, interaction coefficients forfeatures, whether an expected correlation and/or interaction wasvalidated and/or invalidated in S600, etc.). However, the featureselection model can be otherwise trained.

However, the feature selection model can be otherwise configured.

The optional functional property selection model can function to selecta subset of functional properties (e.g., to reduce dimensions, etc.).The functional property selection model inputs can include: functionalproperties, functional property values, target characteristic values,correlation information (e.g., outputs from the correlation model,correlation coefficients, interaction effects, etc.), the database,and/or any other protein set information. The functional propertyselection model outputs can include: a functional property subset,target functional properties (e.g., positive and/or negative targets),and/or any other functional properties. The functional propertyselection model can use: supervised selection (e.g., wrapper, filter,intrinsic, etc.), unsupervised selection, recursive feature selection,lift analysis (e.g., based on a functional property's lift), anyexplainability and/or interpretability method (e.g., SHAP values),and/or with any other selection method. The functional propertyselection model can be a correlation model (and/or vice versa), caninclude a correlation model (and/or vice versa), can take correlationmodel outputs as inputs (and/or vice versa), be otherwise related to acorrelation model, and/or be unrelated to a correlation model.

However, the functional property selection model can be otherwiseconfigured.

The optional similarity model can function to compare two sets ofcharacteristic values. The similarity model inputs can include candidateprotein set characteristic values, target characteristic values, and/orany other information. The similarity model outputs can include acomparison metric. The similarity model can use: comparison methods(e.g., matching, distance metrics, etc.), thresholds, optimizationmethods, regression, selection methods, classification, neural networks(e.g., CNNs, DNNs, etc.), clustering methods, rules, heuristics,equations (e.g., weighted equations, etc.), and/or any other methods.The comparison metric can be qualitative, quantitative, relative,discrete, continuous, a classification, numeric, binary, and/or beotherwise characterized. The comparison metric can be or include adistance, difference (e.g., vector of differences between values foreach characteristic, vector of squared differences between values foreach characteristic), ratio, regression, residuals, clustering metric(e.g., wherein multiple samples of the candidate and/or target proteinsets are evaluated, wherein multiple candidate and/or target proteinsets are evaluated, etc.), a statistical measure, and/or any othercomparison measure. In an example, the comparison metric is a distancein feature space (e.g., wherein a characteristic value set is anembedding in the feature space). In a specific example, the comparisonmetric is low (e.g., the candidate protein set is similar to the targetproduct/protein set) when the candidate protein set characteristicvalues are near (in feature space) positive target characteristic valuesand/or far from negative target characteristic values. However, thesimilarity model can be otherwise configured.

The optional structure prediction model functions to predict the proteinfolding structure, given the context. The resultant structure can beparametrized and used to determine the protein set feature values, usedto determine the functional property values, or otherwise used. Examplesof structure prediction models that can be used include: AlphaFold,I-TASSER, HHpred, and/or any other suitable protein structure predictionmodel.

However, the models can be otherwise defined.

The system can optionally include an evolutionary tree (e.g.,representing evolutionary relationships or distances between proteinsources, protein sets, etc.). The evolutionary tree and/or evolutionarydistances based on the evolutionary tree can be predetermined (e.g.,where the evolutionary tree is stored in the system database and/or athird-party database), be retrieved (e.g., for each source in thedatabase), and/or be otherwise determined. The evolutionary tree can beused to identify features, facilitate protein and/or protein setselection, discover a protein source component for a given protein set,and/or be otherwise used. In an example, the evolutionary tree can betraversed to identify candidate protein sources and/or protein sourcecomponents (e.g., source components that are more commercially feasible)that might have similar protein sets to a given protein source.

5. Method

As shown in FIG. 1 , the method can include: characterizing a proteinset S100, training a prediction model S300, determining targetcharacteristic values S400, determining a candidate protein set based onthe target characteristic values S500, and/or any other suitable steps.The method can optionally include selecting a feature subset S200,selecting a functional property subset S250, evaluating the candidateprotein set S600, and/or any other suitable steps.

The method can be performed once (e.g., for a given target), iteratively(e.g., to train one or more models, to iteratively improve determinationof a candidate protein set, etc.), concurrently with data generation(e.g., where a database of characterized and/or uncharacterized sourcesis iteratively updated while one or more protein set determinationevents are occurring), and/or at any other suitable frequency. All orportions of the method can be performed in real time (e.g., responsiveto a request), iteratively, asynchronously, periodically, and/or at anyother suitable time. All or portions of the method can be performedautomatically, manually, semi-automatically, and/or otherwise performed.All or portions of the method can be performed during training and/orinference (e.g., prediction).

All or portions of the method can be performed by one or more componentsof the system, by a user, by a computing system, and/or by any othersuitable system. The computing system can include one or more: CPUs,GPUs, custom FPGA/ASICS, microprocessors, servers, cloud computing,and/or any other suitable components. The computing system can be local,remote, distributed, or otherwise arranged relative to any other systemor module.

Characterizing a protein set S100 functions to determine abstractedcharacterizations (e.g., feature values, functional property values,etc.) of the protein set, wherein the characterizations can be used totrain the prediction model and/or any other model (e.g., to generatetraining data), to determine correlations between features andfunctional properties, to expand the database, and/or for any otherdownstream functionality. S100 can be performed before S400 and/or atany other time.

In a first variant, the protein set is characterized as a whole (e.g.,where characteristic values are determined for and/or associated withthe protein set as a unit). In a second variant, the protein set ischaracterized based on the characteristic values (e.g., feature values,functional property values, etc.) of the constituent proteins. In afirst example, each constituent protein is individually characterized,and the mixture characterization is determined based on the individualcharacterizations (e.g., a set including the individual characteristicvalues, aggregated individual characteristic values, characteristicvalues weighted based on the concentration of the constituents in theprotein mixture, characteristic values weighted based on a relativeimportance for a constituent protein in influencing functionalproperties, etc.). In a specific example, the protein setcharacterization can be determined using a model (e.g., the featureaggregation model, a machine learning model, etc.) that determinesprotein set characteristic values based on the individualcharacterizations of the constituent proteins. In a second example, asubset of proteins in the mixture are assigned characteristic values(e.g., only the highest concentrated protein(s) are assigned featurevalues and/or functional property values, proteins having aconcentration percent value higher than a threshold, etc.).

Characterizing a protein set can include: optionally determining acomposition of the protein set (e.g., S120), determining sequences forthe protein set (e.g., S140), determining feature values for the proteinset (e.g., S160), determining functional property values for the proteinset (e.g., S180), and/or optionally determining a functionality (e.g.,impact on functional properties, interaction with other molecules,structural functions, etc.) of the protein set (e.g., using machinelearning annotation, using a correlation model, using explainabilityand/or interpretability methods, etc.). In variants, S120, S140, S160,and S180 are performed for training protein sets, while only S120, S140,and S160 are preformed for candidate protein sets. However, S100 can beotherwise performed.

Characterizing the protein set can optionally include manufacturing asample using the protein set (e.g., wherein the manufacturing process isdefined based on a context associated with the protein set), wherein allor parts of S100 are performed for the sample. The sample can optionallybe processed prior to, during, or after, performing any assay (e.g.,using dilution, centrifugation, dehydration, lyophilization,reconstitution, concentration methods, etc.).

Determining a composition of the protein set S120 functions to identifyeach protein and/or the concentration of each protein in the set (e.g.,a concentration for each protein within the protein set, a concentrationof each protein within a sample containing the protein set, etc.). In afirst variant, the composition can be manually or automaticallyspecified (e.g., for a candidate protein set). In a second variant, thecomposition of the protein set can be measured (e.g., using massspectrometry proteomics, a Bradford assay, capillary ElectrophoresisSDS, and/or any other assay). For example, a sample can be manufacturedusing the protein set, wherein the protein set composition in the sampleis measured using one or more assays and/or assay tools. In a specificexample, a total protein quantification and individual proteinabundances can be measured for the sample, wherein the concentration foreach protein in the sample is based on the total protein quantificationand individual protein abundances. In a third variant, the compositioncan be inferred using bioinformatics (e.g., machine learning techniquesapplied to codons), genomics, transcriptomics, and/or other proteinexpression prediction techniques. However, the composition can beotherwise determined.

The protein concentrations (e.g., mol %, wt %) can be used to identifythe most abundant proteins in the set, to weight variables (e.g.,features), used in downstream analyses to determine proteins that have adisproportionate effect on functional properties relative to theirconcentration, and/or otherwise used. For any part of the method, aprotein set and/or data associated with a protein set can be adjustedbased on the protein composition. In a first example, a subset of theprotein set is determined (e.g., to represent the complete protein set),wherein the subset includes the highest prevalence proteins in the set.In a first specific example, proteins that occupy a proportion of theprotein set above a threshold percentage are selected as the subset,wherein the threshold percentage can be between 0.5%-50% or any range orvalue therebetween (e.g., 1%, 2%, 5%, 10%, 15%, 20%, 25%, 50%, etc.),but can alternatively be less than 0.5% or greater than 50%. In a secondspecific example, proteins with an overall concentration in the sampleabove a threshold percentage (e.g., mol %, wt %) are selected as thesubset, wherein the threshold percentage can be between 0.05%-20% or anyrange or value therebetween (e.g., 0.1%, 0.2%, 0.5%, 1%, 2%, 5%, 10%,15%, etc.), but can alternatively be less than 0.05% or greater than20%. In a third specific example, a threshold number of the highestprevalence proteins are selected for the subset, wherein the thresholdnumber can be between 1-100 or any range or value therebetween (e.g.,2-10, 5, 10, 15, etc.), but can alternatively be greater than 100. In athird specific example, proteins having a certain set of characteristics(e.g., binding affinity, adsorption affinity, etc.) that enable easyextraction and/or purification can be selected as the subset. In asecond example, data associated with a protein set can be weighted basedon the proportions of each constituent protein.

However, the protein set composition can be otherwise determined.

Determining sequences for the protein set S140 functions to determineinformation for feature extraction (e.g., for sequence-based features)and/or to directly determine feature values (e.g., where the featurevalues are sequences). Sequences can be measured (e.g., using an assay),retrieved (e.g., from a third-party database), and/or otherwisedetermined. Determining sequences can optionally include determiningsecondary information associated with the sequences (e.g., proteinstructure information, metadata, etc.). In a first variant, a sequenceis preferably determined for each individual protein in the protein set(e.g., retrieved from a databased, determined using protein sequencing,etc.), but can alternatively be determined for a subset of proteins inthe protein set, be determined directly for the protein set as a whole,and/or be otherwise determined. However, sequences for the protein setcan be otherwise determined.

Determining feature values for the protein set S160 functions tocomputationally identify characterization values (e.g., molecularproperty values) of the protein set. S160 can be performed one or moretimes for each protein in a protein set, one or more times for eachprotein in a subset of the protein set (e.g., for the highest prevalenceproteins in the protein set), one or more times for each protein set(e.g., iterating through a database), after S200 (e.g., where featurevalues for a protein set are determined for the features selected inS200), and/or at any other suitable time. The feature values canoptionally be a feature value vector (e.g., wherein each element of thevector is a feature value for a feature in a feature set).

Feature values are preferably determined using the feature extractionmodel, but can alternatively be otherwise determined. In a firstvariant, feature values can be computationally determined. In a firstexample, feature values can be extracted from sequences (e.g., aminoacid sequences). In a second example, feature values can be based on acomputationally-determined protein charge and/or charge distribution. Ina third example, feature values can be determined based on a modeledfolding pattern (e.g., a likely protein folding pattern). In a fourthexample, context feature values can be determined based on contextinformation (e.g., extracted from ingredient lists, treatments, proteinmodifications, etc.) and optionally the protein sequences. In a secondvariant, feature values can be measured and/or extracted frommeasurements (e.g., experimentally determined using assays). In a thirdvariant, feature values can be determined using a simulation (e.g.,protein folding simulation, protein functionality simulation, proteininteraction simulation, etc.). In a fourth variant, feature values canbe retrieved from a database (e.g., a third-party database, the systemdatabase, etc.). In a fifth variant, a first subset of feature valuescan be determined using a first feature extraction model while theremaining feature values are determined using a second featureextraction model (e.g., using values from the first feature subset,using other information, etc.).

Feature values can be determined using one or more of the variants. In afirst example, feature values can be computationally determined andsubsequently validated and/or updated using measurements (e.g., valuesfor water binding capacity can be estimated based on computationallydetermined charge distribution and/or folding pattern, then subsequentlytested using centrifugal compression). In a second example, the aminoacid sequence for each protein of the protein set (e.g., for a subset ofthe protein set including the highest prevalence proteins) can beretrieved from a third-party database, then feature values can besubsequently extracted based on the retrieved sequences. In a thirdexample, context feature values can be determined based on contextinformation retrieved from the database, and sequence feature values canbe determined using a feature extraction model.

S160 can optionally include aggregating feature values across individualproteins in the protein set (e.g., all proteins in the set, a subset ofthe protein set, etc.). For example, an aggregated feature value set(e.g., aggregated feature value vector) can be determined for theprotein set based on feature value sets (e.g., feature value vectors)for one or more proteins in the protein set. The feature values arepreferably aggregated using the feature aggregation model, but canalternatively be otherwise aggregated. In a first example, aggregatingfeature values includes summing the values for each feature across theproteins of the protein set (e.g., optionally weighted by concentrationor abundance). In a second example, aggregating feature values includespredicting an aggregated feature vector based on the feature value setfor each protein of the protein set and optionally the respectiveprotein concentration or abundance (e.g., wherein the feature value setscan be concatenated, fed to different input heads, etc.). In a thirdexample, aggregating feature values can include predicting theaggregated feature vector based on the protein sequences of the proteinswithin the protein set (e.g., wherein the protein sequences can beconcatenated, fed to different input heads, etc.).

However, feature values can be otherwise determined.

Determining functional property values for the protein set S180functions to determine behavior of the protein set. S180 can beperformed before S160, after S160, during S600, iteratively, and/or atany other suitable time.

The functional property values are preferably measured and/or otherwisedirectly determined values for a set of functional properties, but canalternatively be manually assigned, inferred, predicted, or otherwisedetermined. In a first variant, functional property values are measuredand/or extracted from measurements (e.g., measurements determined usingany assay and/or assay tool). For example, a sample can be manufacturedusing the protein set, wherein the functional property values for theprotein set are measured using one or more assays and/or assay tools. Inan illustrative example, the functional property values are determinedfor a protein and lipid gel (e.g., wherein the gel manufacturing isprescribed by a context associated with the protein set). The functionalproperty values can be determined using one or more experimentalenvironments, treatments, and/or any other variable (e.g., where the setof functional property values determined in an environment areassociated with that variable). In a second variant, the functionalproperty values can be retrieved from a database (e.g., a third-partydatabase). In a third variant, the functional property values can becomputationally determined. In a first example of the third variant, thefunctional property values can be determined based on simulations (e.g.,computer simulations of protein dynamics). In a second example of thethird variant, the functional property values can be predicted usingprediction model (e.g., based on the protein set feature values, etc.).

However, functional property values can be otherwise determined.

The method can optionally include selecting a feature subset S200, whichfunctions to select features which most likely influence (e.g., have ameasurable effect on, a significant effect on, a disproportionate effectrelative to their concentration, etc.) one or more functional propertiesand/or to reduce feature space dimensions (e.g., to reduce computationalload). S200 can be performed after S100, before S400, during and/orafter S300, and/or at any other suitable time. The feature subset can beselected using a feature selection model, using a correlation model,randomly, with human input, and/or be otherwise determined.

In a first variant, the feature subset can be features (e.g., targetfeatures) that influence functional properties. In a first embodiment,the feature selection model uses lift analysis (e.g., applied to aprediction model trained to output functional property values based onthe feature values) to select the subset of features with lift above athreshold. In a second embodiment, features with prediction modelweights above a threshold value are selected as the feature subset,wherein the model weights can be determined during and/or afterprediction model training. In a third embodiment, a correlation modelcan be used to determine features positively and/or negativelycorrelated to one or more functional properties (e.g., absolute value ofcorrelation coefficient above a threshold, a confidence score above athreshold, etc.).

In a second variant, the subset of features can be determined using anydimensionality reduction technique (e.g., principal component analysis,linear discriminant analysis, etc.).

In a third variant, the subset of features can be determined based on acomparison between a target (e.g., a target protein set and/or targetproduct) and a candidate protein set (e.g., a prototype protein set),wherein the subset of features (e.g., used to predict functionalproperties for a second candidate protein set) can be selected based onthe similarities and/or differences between the respective functionalproperty values. In a first example, a difference between functionalproperty values associated with the target and candidate protein set canbe determined (e.g., where values for one or more functional propertiesdiffer significantly between the target and candidate). The differingfunctional property values can define a functional property subset(e.g., target functional properties). These target functional propertiescan then be used to determine a feature subset (e.g., target features),wherein the feature subset can be the feature(s) mostly likely toinfluence the functional property subset (e.g., based on a correlationmodel output). In a second example, a difference between feature valuesassociated with the target and candidate protein set can be determined(e.g., where one or more functional property values differ between thetwo sets). The features associated with the differing feature values candefine the feature subset (e.g., target features).

However, the feature subset can be otherwise selected.

The method can optionally include selecting a functional property subsetS250, which functions to reduce functional property space dimensions(e.g., to reduce computational load). S250 can be performed after S100,before S400, during and/or after S300, and/or at any other suitabletime. The functional property subset can be selected using a featureselection model, using a correlation model, randomly, with human input,and/or be otherwise determined.

In a first variant, the subset of functional properties can bedetermined using any dimensionality reduction technique (e.g., principalcomponent analysis, linear discriminant analysis, etc.).

In a second variant, the subset of functional properties can bedetermined based on a comparison between a target (e.g., a targetprotein set and/or target product) and a first candidate protein set(e.g., a prototype protein set), wherein the subset of functionalproperties can be selected based on the similarities and/or differencesbetween the functional property values for the target and the firstprotein set. In an example, a difference between functional propertyvalues associated with the target and candidate protein set can bedetermined (e.g., where values for one or more functional propertiesdiffer significantly between the two sets). The differing functionalproperty values can define a functional property subset (e.g., targetfunctional properties).

However, the functional property subset can be otherwise selected.

Training a prediction model S300 functions to improve functionalproperty value predication, candidate protein set determination (usingthe prediction model), and/or any other part of the method. S300 can beperformed after S100 and/or at any other time.

In variants, training the prediction model includes determining trainingdata including feature values (e.g., determined via S160) andcorresponding functional property values (e.g., determined via S180) forone or more protein sets (e.g., a set of protein sets). The functionalproperty values in the training data are preferably measured, but canalternatively be otherwise determined (e.g., using any other method inS180). The prediction model is then trained using the training data topredict the functional property values for a protein set based on thefeature values for the protein set. Examples are shown in FIG. 6 , FIG.7A, and FIG. 7B.

In any variant, the training data can include positive samples (e.g.,with no negative samples), wherein the prediction model is trained usingpositive-unlabeled learning. Alternatively or additionally, the trainingdata can include negative samples, wherein the prediction model can betrained to distance the prediction from the negative samples.

However, one or more prediction models can be otherwise trained.

Determining target characteristic values S400 functions to specify oneor more criteria for candidate protein set determination. For example,the candidate protein set can be selected to manufacture an analog for atarget product, to replace a target protein set (e.g., a protein set tobe replicated and/or replaced, a protein set to be replicated withspecified modifications, etc.), to meet a desired set of characteristicvalues, and/or otherwise used. S400 can be performed after S100 (e.g.,after a target protein set has been characterized) and/or at any othertime.

The target characteristic values are preferably associated with acharacterized protein set (e.g., a characterized target protein set),but alternatively can be associated with an uncharacterized protein set,be associated with a source and/or source component, be associated witha target product (e.g., target food product), be otherwise associatedwith protein set information, and/or not be associated with a proteinset and/or source. The target characteristic values can be all or asubset of: the functional property values, the feature values, the aminoacid sequences, and/or any other characteristic value associated with atarget: product, source, source component, and/or protein set.

The target characteristic values (e.g., a target characteristic valuevector) can be determined manually, automatically, predetermined, with amodel (e.g., target features selected using a feature selection model,target functional properties selected using a functional propertyselection model, etc.), based on a target product and/or target proteinset, based on a use case (e.g., the use case for the candidate proteinset, for the associated target protein set, etc.), retrieved from adatabase (e.g., where target functional property values are thoseassociated with a target protein set in the database), measured, and/orbe otherwise determined.

In a first variant, the target characteristic values include targetfeature values. In a first embodiment, the target feature values can bedetermined for a target protein set using S160 methods. In a specificexample, a subset of feature values of the target protein set can beused as the target characteristic values, where the subset cancorrespond to the feature subset determined in S200. In a secondembodiment, target functional property values are used to determinetarget feature values. In a specific example, a correlation model isused to identify feature values associated with the target functionalproperty values.

In a second variant, the target characteristic values include targetfunctional property values. In a first embodiment, the target functionalproperty values can be determined for a target product and/or proteinset using S180 methods. In a second embodiment, the target functionalproperty values can be manually specified (e.g., desired or optimalfunctional property values for a product, a desired change in functionalproperty values relative to functional property values for a proteinset, etc.).

In a third variant, the target characteristic vales can include targetfeature values and target functional property values (e.g., acombination of the first and second variants).

However, target characteristic values can be otherwise determined.

Determining a candidate protein set based on the target characteristicvalues S500 functions to determine a protein set that satisfies targetcriteria (e.g., has desired characteristic values, mimics a targetproduct/protein set, etc.). Additionally or alternatively, S500functions to determine a candidate protein set for evaluation in S600(e.g., wherein characterization of the candidate protein set can trainthe prediction model). S500 can be performed after S400, after S300,during S300 (e.g., as part of training), and/or at any other suitabletime.

Determining the candidate protein set can optionally include determiningthe composition of the candidate protein set (e.g., determining eachprotein in the set and/or determining the concentration of each proteinin the set) and/or selecting a context for the candidate protein set.

In a first variant, determining each protein in the candidate proteinset includes individually selecting each individual protein in thecandidate protein set from proteins in a candidate group of proteinsets. In a second variant, determining each protein in the candidateprotein set includes selecting the candidate protein set as a whole fromthe candidate group of protein sets.

The candidate group of protein sets can include uncharacterized proteinsets, partially characterized protein sets (e.g., with feature valuesbut not functional property values), fully characterized protein sets(e.g., with both feature values and functional property values), knownor estimated abundant protein sets (e.g., determined based on functionalprotein labelling), and/or any other set of protein sets. The candidategroup can optionally be a subset of the system database (e.g., to reducethe computational resources, to reduce the search space, to constrainall or parts of the selection, etc.). For example, the candidate groupcan include a subset of protein sources (e.g., candidate proteinsources, wherein all or parts of the protein sets associated with eachcandidate protein source are included), a subset of protein sets, and/orany other subset. In a first specific example, an evolutionary tree isused to identify protein sources evolutionarily related to a targetprotein source, wherein the candidate group includes protein setsassociated with the identified protein sources. In a second specificexample, the candidate group includes a set of protein sets with targetfeature values (e.g., within a threshold similarity to target featurevalues).

Each protein set in the candidate group can optionally be associatedwith one or more concentrations and/or contexts. For example, eachprotein set can be associated with a predetermined set of possiblevalues for each concentration and context parameter (e.g., a protein setcan be associated with each unique combination of possible compositionsand context values). In an illustrative example, a protein set in thecandidate group includes [Protein 1, Protein 2]; the possiblecompositions for the protein set include: [70%, 30%], [30%, 70%], and[50%, 50%]; the possible contexts for the protein set include: [combinewith canola oil, heat to 65° C., glycosylation of Protein 1], [combinewith kokum butter, heat to 65° C., glycosylation of Protein 1], [combinewith canola oil, heat to 72° C., glycosylation of Protein 1], [combinewith kokum butter, heat to 72° C., glycosylation of Protein 1], [combinewith canola oil, heat to 65° C., no glycosylation of Protein 1],[combine with kokum butter, heat to 65° C., no glycosylation of Protein1], [combine with canola oil, heat to 72° C., no glycosylation ofProtein 1], and [combine with kokum butter, heat to 72° C., noglycosylation of Protein 1].

The candidate protein set can be determined based on: the target andcandidate protein set's characteristic values (e.g., functional propertyvalues, feature values, etc.), estimated abundance and/or ease ofextraction (e.g., determined based on the protein set's functionality,the protein source, the source component, etc.), the database, and/orany other factor. Any candidate protein set determination method canoptionally be supplemented based on protein source and/or sourcecomponent information (e.g., where the probability of selecting aprotein set as the candidate protein set increases if the protein set islikely to be abundant within the protein source and/or the proteinsource itself is likely to be abundant relative to a threshold).

In a first variant, the candidate protein set is determined usingoptimization approaches (e.g., Bayesian optimization, machine learningrecommender systems, etc.). For example, the candidate protein set canbe selected as a training protein set for characterization (e.g., toexpand the training data for use in S300), wherein optimizationapproaches can be used to reduce (e.g., minimize) the number ofadditional training protein sets that are needed to train the predictionmodel and/or to identify a candidate protein set that satisfies thetarget criteria.

In a second variant, the candidate protein set can be determined bycomparing (e.g., matching) one or more characteristic values using asimilarity model to generate a comparison metric (e.g., example shown inFIG. 9 ). For example, characteristic values for each protein set in thecandidate group (e.g., for each unique protein set composition andcontext pair) can be predicted (e.g., using the prediction model),wherein the predicted characteristic values are compared to the targetcharacteristic values to generate the comparison metric. The candidateprotein set (e.g., with associated composition and context) can then bedetermined based on the comparison metric (e.g., selecting the proteinset with the minimum or maximum comparison metric, selecting a proteinset with a comparison metric above or below a threshold, selecting theprotein set using a protein set determination model, etc.).

In a first embodiment, the candidate protein set's predicted functionalproperty values can be compared to target functional property values(e.g., for an analogous set of functional properties). For example, thecandidate protein set's functional property values are predicted usingthe prediction model (e.g., based on feature values, based on context,etc.).

In a second embodiment, the candidate protein set's feature values canbe compared to target feature values. For example, a match betweenpositive target feature values and candidate protein set feature valuescan increase the probability of selection of the candidate protein set,whereas a match between negative target feature values and to candidateprotein set feature values can the probability of selection.

In a third embodiment, a candidate protein source and/or candidateprotein set can be selected based on an evolutionary tree. In a firstexample, the candidate protein source is selected by identifying aprotein source based on a close evolutionary relationship with a targetprotein source and/or protein source containing a matching candidateprotein set. In a second example, the candidate protein set can beselected by identifying close evolutionary relationships betweenproteins in the candidate protein set and proteins in a target proteinset. In a third example, additional candidate protein set(s) can beselected after a first selection event of a first candidate protein setby identifying additional protein set(s) based on close evolutionaryrelationships to the first candidate protein set.

The candidate protein set can optionally be used to manufacture ananalog for a target food product (e.g., dairy analog, meat analog, egganalog, any animal product analog, etc.) and/or any other sample (e.g.,product). For example, a protein source associated with the candidateprotein set can be selected as an ingredient for manufacturing aproduct. In a specific example, proteins in the candidate protein setcan be extracted and/or isolated from one or more sources, wherein asample is manufactured (e.g., based on a context associated with thecandidate protein set) using the proteins to have the determinedcandidate protein set composition.

However, the candidate protein set can be otherwise determined.

The method can optionally include evaluating the candidate protein setS600, which functions to determine whether the candidate protein set canbe used in an analog for a target product, whether the candidate proteinset can be used as a replacement for a target protein set, whether thecandidate protein set has the desired (e.g., target) characteristicvalues, to determine feedback for a model (e.g., for training theprediction model, the protein set determination model, and/or any othermodel), and/or to compare the functional property values of thecandidate protein set to one or more other functional property values.

S600 can be performed after S500, after S180 (e.g., after the candidateprotein set is characterized with functional property values),iteratively (e.g., until a stop condition is met, such as substantialsimilarity to the target), and/or at any other time. A search for aprotein set can be continued (e.g., iteratively performing S500 andS600) until a candidate protein set satisfies a set of target criteria(e.g., stopping when the evaluation indicates that the candidate proteinset characteristic values fall within target ranges), until a comparisonmetric is below or above a threshold, for a predetermined number ofiterations, and/or until any other stop condition is met. In an example,the target criteria include one or more ranges of characteristic valuesbased on target characteristic values (e.g., predetermined ranges aroundthe target characteristic values).

S600 can include: determining functional property values for thecandidate protein set (e.g., S180 performed for the candidate proteinset), and determining a comparison metric based on the resultantfunctional property values (e.g., using the similarity model). In anexample, determining functional property values for the candidateprotein set includes manufacturing a sample containing the candidateprotein set (e.g., at a protein composition determined in S500 and/orusing a context determined in S500), wherein the sample is subjected toassays to measure functional property values.

The sample (e.g., target food replica) can be manufactured by mixing theprotein set with a set of other ingredients (e.g., plant-derivedingredients, such as fats, oils, sugars, etc.) and processing themixture (e.g., by heating, reacting, inoculating, fermenting, etc.).Alternatively, the sample can be manufactured by gelling the protein,then using the gel as an ingredient. The manufactured samples can beentirely or mostly plant-derived (e.g., more than 70%, 80%, 90%, 99%,etc. plant-derived components by weight or volume).

In a first embodiment, the comparison metric can be based on acomparison between the candidate protein set's measured functionalproperty values and predicted functional property values (e.g.,predicted functional property values for the candidate protein setdetermined using the prediction model). In a second embodiment, thecomparison metric can be based on a comparison between the candidateprotein set's measured functional property values and target functionalproperty values (e.g., the functional property values of a targetprotein set). In variants, a comparison metric above or below athreshold (e.g., a significant difference between the actual and targetand/or predicted functional property values) corresponds to negativefeedback in model training (e.g., S400 and/or any other model training).

However, the candidate protein set can be otherwise evaluated.

The method can optionally include determining interpretability and/orexplainability of the trained prediction model, which can be used toselect features, select functional properties, identify errors in thedata, identify ways of improving the prediction model, increasecomputational efficiency, determine influential features and/or valuesthereof, determine influential functional properties and/or valuesthereof, and/or otherwise used. Interpretability and/or explainabilitymethods can include: local interpretable model-agnostic explanations(LIME), Shapley Additive explanations (SHAP), Ancors, DeepLift,Layer-Wise Relevance Propagation, contrastive explanations method (CEM),counterfactual explanation, Protodash, Permutation importance (PIMP),L2X, partial dependence plots (PDPs), individual conditional expectation(ICE) plots, accumulated local effect (ALE) plots, Local InterpretableVisual Explanations (LIVE), breakDown, ProfWeight, Supersparse LinearInteger Models (SLIM), generalized additive models with pairwiseinteractions (GA2Ms), Boolean Rule Column Generation, Generalized LinearRule Models, Teaching Explanations for Decisions (TED), and/or any othersuitable method and/or approach.

Alternative embodiments implement the above methods and/or processingmodules in non-transitory computer-readable media, storingcomputer-readable instructions that, when executed by a processingsystem, cause the processing system to perform the method(s) discussedherein. The instructions can be executed by computer-executablecomponents integrated with the computer-readable medium and/orprocessing system. The computer-readable medium may include any suitablecomputer readable media such as RAMs, ROMs, flash memory, EEPROMs,optical devices (CD or DVD), hard drives, floppy drives, non-transitorycomputer readable media, or any suitable device. The computer-executablecomponent can include a computing system and/or processing system (e.g.,including one or more collocated or distributed, remote or localprocessors) connected to the non-transitory computer-readable medium,such as CPUs, GPUs, TPUS, microprocessors, or ASICs, but theinstructions can alternatively or additionally be executed by anysuitable dedicated hardware device.

Embodiments of the system and/or method can include every combinationand permutation of the various system components and the various methodprocesses, wherein one or more instances of the method and/or processesdescribed herein can be performed asynchronously (e.g., sequentially),contemporaneously (e.g., concurrently, in parallel, etc.), or in anyother suitable order by and/or using one or more instances of thesystems, elements, and/or entities described herein. Components and/orprocesses of the following system and/or method can be used with, inaddition to, in lieu of, or otherwise integrated with all or a portionof the systems and/or methods disclosed in the applications mentionedabove, each of which are incorporated in their entirety by thisreference.

As a person skilled in the art will recognize from the previous detaileddescription and from the figures and claims, modifications and changescan be made to the preferred embodiments of the invention withoutdeparting from the scope of this invention defined in the followingclaims.

1. A composition comprising: plant derived proteins covalently bonded toa sugar, a base, and a cheese-starter culture, wherein said plantderived proteins are not soy derived, and wherein said composition isfree of animal proteins.
 2. The composition of claim 1, wherein thesugar is glucose.
 3. The composition of claim 1, wherein the base is asodium hydroxide.
 4. The composition of claim 1, further comprising oilsor fats isolated from plant sources.
 5. The composition of claim 1,wherein the covalent bonds are formed between a sugar and a lysine sidechain of a protein constituent.
 6. The composition of claim 1, whereinthe cheese-starter culture is selected from the group consisting of: aPenicillium camemberti, Penicillium candidum, Geotrichum candidum,Penicillium roqueforti, Penicillium nalgiovensis, Verticillium lecanii,Kluyveromyces lactis, Saccharomyces cerevisiae, Candida utilis,Debaryomyces hansenii, Rhodosporidum infirmominiatum, Candida jefer,Cornybacteria, Micrococcus sps., Lactobacillus sps., Lactococcus,Staphylococcus, Halomonas, Brevibacterium, Psychrobacter,Leuconostocaceae, Streptococcus thermophilus, Pediococcus sps.,Propionibacteria culture, and combinations thereof.
 7. A compositioncomprising: a non-dairy milk having at least 80% of its insoluble solidsremoved, a base, and a sugar, wherein said non-dairy milk is selectedfrom the group consisting of hemp milk, sesame milk, pumpkin milk,almond milk, cashew milk, brazilnut milk, chestnut milk, coconut milk,hazelnut milk, macadamia nut milk, pecan milk, pistachio milk, walnutmilk, and combinations thereof.
 8. The composition of claim 7, whereinthe sugar is glucose.
 9. The composition of claim 7, wherein the base isa sodium hydroxide.
 10. The composition of claim 7, wherein thecomposition is free of animal proteins.
 11. The composition of claim 7,wherein a single monomeric or multimeric protein represents at least 80%of the protein content of the composition.
 12. A cheese replicacomprising: a gelled emulsion of one or more glycated proteins derivedfrom plants, one or more fats, a base, and a cheese-starter culture,wherein said proteins are not soy derived, and wherein said cheesereplica is free of animal proteins.
 13. The cheese replica of claim 12,wherein the glycated proteins comprise proteins covalently bonded toglucose.
 14. The cheese replica of claim 12, wherein the base is asodium hydroxide.
 15. The cheese replica of claim 12, wherein the one ormore plant-derived proteins and one or more fats are from nuts, legumesother than soybeans, or seeds.
 16. The cheese replica of claim 15,wherein the nuts comprise almonds, cashews, brazilnuts, coconuts,chestnuts, hazelnuts, macadamia nuts, pecans, pistachios, walnuts, orcombinations thereof.
 17. The cheese replica of claim 15, wherein theseeds comprise hemp, sesame, pumpkin, watermelon, or combinationsthereof.
 18. The cheese replica of claim 12, wherein the plant-derivedproteins comprise one or more proteins selected from the groupconsisting of a globulin, a pseudoglobulin, a globular protein, aprolamin, an albumin, a gluten, a gliadin, a conglycinin, an hordein, aphasolin, a zein, an olsosin, a caloleosin, a sterelosin, a conjugatedprotein, a seed storage protein, and a vegetative storage protein. 19.The cheese replica of claim 12, wherein the cheese replica is formulatedas a fermented cheese replica.
 20. The cheese replica of claim 12,wherein the cheese replica is formulated as a soft cheese replica.